Prerequisites

Before you begin to work with Databricks Lakehouse (Delta) as a target in Qlik Replicate, make sure the following prerequisites have been met.

Client prerequisites

Driver

When Replicate Server is running on Windows or Linux, download and install Simba Spark ODBC Driver 2.6.22 on the Qlik Replicate Server machine.

Replicate on Linux

When Replicate server is running on Linux, you also need to add the following section to the /etc/odbcinst.ini file:

[Simba Spark ODBC Driver]
Description=Amazon Hive ODBC Driver (64-bit)
Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so

Performance and cloud services usage optimization

In order to optimize both cloud service usage and overall performance, the change processing mode must be set to Batch optimized apply in the Change Processing Tuning tab. It is also strongly recommend to enable the Apply batched changes to multiple tables concurrently option in the same tab.

When the Apply batched changes to multiple tables concurrently option is selected, the option to set a Global Error Handling policy will not be available. Also, some of the task-specific error handling defaults will be different.

Storage access

Databricks SQL compute must be configured to access cloud storage. For instructions, see the vendor’s online help.

Permissions and access

The time on the Qlik Replicate Server machine must be accurate.
Databricks table permissions: Replicate requires permissions to perform the following operations on Databricks tables: CREATE, DROP, TRUNCATE, DESCRIBE, and ALTER table.
In the Access Control (IAM) settings for the ADLS Gen2 file system, assign the “Storage Blob Data Contributor” role to Replicate (AD App ID). It may take a few minutes for the role to take effect.
In order for Replicate to connect to a Databricks cluster via ODBC, users must be granted "Can Attach To" permission in their Databricks account.
A valid security token is required to access Databricks. The token should be specified when configuring the Databricks ODBC Access fields in the endpoint settings.
When configuring a new cluster with Microsoft Azure Data Lake Storage (ADLS) Gen2, the following line must be added to the "Spark Config" section.

spark.hadoop.hive.server2.enable.doAs false
To be able to access the storage directories from the Databricks cluster, users need to add a configuration (in Spark Config) for that Storage Account and its key.

Example:

fs.azure.account.key.<storage-account-name>.dfs.core.windows.net <storage-account-access-key>

For details, refer to the Databricks online help at: https://docs.databricks.com/clusters/configure.html#spark-configuration
Best practice is not to use the root location (/Usr/Hive/Warehouse/) for the Databricks database as doing so may impact performance.

Staging permissions

Different permissions are required depending on the Storage type you select during staging, see Setting general connection properties.

Amazon S3

You must have an Amazon S3 bucket that is accessible from the Replicate Server machine.

For information on signing up for Amazon S3, see http://aws.amazon.com/s3/.
Replicate connects to AWS using SSL. This requires an appropriate CA certificate to reside on the Replicate Server machine; otherwise, the connection will fail. The purpose of the CA certificate is to authenticate the ownership of the AWS server certificate.

On Windows, the required CA certificate is always present whereas on Linux it may sometimes be missing. Therefore, if you are using Replicate for Linux, make sure that the required CA certificate exists in the following location:

/etc/pki/tls/certs/ca-bundle.crt

If it does not exist, the simplest solution is to copy the certificates bundle from another Linux machine.
Bucket access credentials: Make a note of the bucket name, region, access key and secret access key - you will need to provide them in the Qlik Replicate Databricks Lakehouse (Delta) target settings.

Bucket access permissions: Qlik Replicate requires the following bucket access permissions:

{
	"Version": "2012-10-17",
	"Statement": [
	    {
	     "Sid": "Stmt1497347821000",
	     "Effect": "Allow",
	     "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_BUCKET_NAME"
            ]
        },
        {
            "Sid": "Stmt1497344984000",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_BUCKET_NAME/target_path",
                "arn:aws:s3:::YOUR_BUCKET_NAME/target_path/*"
            ]
        }
    ]
}

Google Cloud Storage

The JSON credentials specified in the endpoint's Staging settings must be for an account that has read and write access to the specified bucket and folder.

Microsoft Azure Data Lake Storage (ADLS) Gen2

The Application Registration Client ID specified in the endpoint's Staging settings must have write access to the specified ADLS storage staging folder.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – let us know how we can improve!

Leave your feedback here